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Deep convolutional neural networks (DCNNs) trained on a large number of 
images with pixel-level annotations or a combination of strongly labeled and 
weakly-labeled images have recently been the state-of-the-art in semantic 
image segmentation with significant performance improvement. 

However, due to the very invariance properties that make DCNNs good 
for high level tasks such as classification, visual delineation capacities for 
deep learning techniques are limited. Recent approaches address this prob¬ 
lem with Conditional Random Field (CRF) based graphical model in two 
ways: 

1. Adding a post-processing step of CRF-based probabilistic graphical 
model for the pixel-level classification [3, 8]. 

2. Integrating the graphical model as a part of the CNN to make the 


—.end-to-end learning with the usual back-propagation possible without the 
2;^eed of post-processing [11]. 

In either case, the final pixel-level classification accuracy and efficiency re- 
I— ynain highly dependent on the inference step of the image-based CRF [5] 
H-involved where fast approximate MPM inference is performed using cross 
^__jbilateral filtering techniques within a mean-field approximation framework. 

Alvarez et al. [1] demonstrates that performing inference on all test im- 
' ages at once in a dense CRF yields better results than inferring one im- 
P^age at a time without additional computation cost compared to performing 
C_ Segmentation sequentially on individual images. It is to be noted that the 
^ense CRF [5] achieves good results with only unary and pairwise terms. 
^_^his fully-connected pair-wise model is more expressive than its 4 or 8- 
Connected random field counter-parts. Yet, it lacks the ability to handle 
high-order terms. Models [4, 6, 7] using higher-order terms such as la- 
. tel consistency over large regions (pattern-based potentials) and relations of 
Q^lobal co-occurrence potentials, are shown to be more expressive for object 

_class segmentation task. Filter-based inference for those higher-order terms 

ly-ys formulated in [10] which enables significant speed-up compared to those 
^^raph-based methods [4, 6, 7]. Yet, it needs to consider temporal consis- 
t*~Y encv when applied in co-segmentation or video semantic segmentation. 

—. We explore the efficiency of the CRF inference module beyond im- 

f~*^ ge level semantic segmentation. The key idea is to combine the best of 
l/~j;wo worlds of semantic co-labeling and exploiting more expressive models. 
’“Similar to [1] our formulation enables us perform inference over ten thou- 
^sand images within seconds. On the other hand, it can handle higher-order 
• 1 —clique potentials similar to [10] in terms of region-level label consistency 
^^nd context in terms of co-occurrences. We follow the mean-field updates 
l—for higher order potentials similar to [10] and extend the spatial smooth- 
^Hiess and appearance kernels [5] to address video data inspired by [1]; thus 
making the system amenable to perform video semantic segmentation most 
effectively. 


Figure 1 shows some qualitative results of semantic segmentation in 
Camvid video dataset [2]. In this particular experiment, we used the Tex- 
tonBoost [9] unary potentials for easy comparison with other recent meth¬ 
ods. Video-Level Dense-CRF [1] shows improved temporal consistency 
over frame-level operation [5] (previous row) without additional time over¬ 
head. For, pattern-based potentials, we use three different superpixel seg¬ 
mentations by varying parameters of the meanshift algorithm. Frame-level 
Dense-CRF with this P"-Potts model [10] almost achieves similar quality 
as of previous graph-cut based slow inference method [6], but lacks tempo¬ 
ral consistency. The proposed video-level Dense-CRF with P"-Potts model 
shows improved temporal consistency over the frame-level operation (pre¬ 
vious row) without additional time-overhead. Video-Level dense CRF [1] 
and the proposed method perform inference on 50 frames at once. On this 
dataset, with TextonBoost unaries our proposed method achieves 8% more 
accuracy than [1] by virtue of P^-Potts model and 1.5% more accuracy over 
[10] without additional time overhead by virtue of co-labeling. CNN fea¬ 
ture classification yields improved unary potentials compared to the unaries 


provided by TextonBoost. Analyzing the final video semantic segmentation 
accuracy using CNN based unaries and proposed dense-CRF with P"-Potts 
model remains our future work. 
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Figure 1: Qualitative results on Camvid dataset [2], From top to Bottom : Input frames, Unary potentials from TextonBoost classifier scores [9]; Frame- 
level Dense-CRF [5]; Video-Level Dense-CRF [1] shows improved temporal consistency over frame-level operation (previous row) without additional time 
overhead; frame-level Dense-CRF with P"-Potts model [10]; Proposed video-level Dense-CRF with P"-Potts Model shows improved temporal consistency 
over the frame-level operation (previous row) without additional time-overhead; frame-level Graph-cut based slow inference with P^-Potts Model and the 
Ground truth levels. 






